Proteomics Data from Murine Blood Cells

A Collaborative Data Science Project by Group 6

Introduction to the data

  • Unpublished data from research group at DTU

  • Proteomics data showing differentiation trajectories in murine hematopoiesis

  • Intensity measured by spectrometry gives protein expression

  • Objective: “Examine differences in protein expression across the differentiation trajectory of the Different Cell Types”

Data tidying and data format

  • 5 dataframes joined using  full_join() 

  • variable names are simplified

  • Each sample was a variable we used  pivot_longer()  to get just one intensity column

    Cell Type HSC CMP GMP CLP MEP Total
    No. of protein goups 2196 1986 2291 2382 2691 3765
    No. of replicates (samples) 4 4 4 5 4 21
    # A tibble: 4 × 5
      protein_groups       genes              cell_type replicate_n intensity
      <chr>                <chr>              <chr>           <dbl>     <dbl>
    1 O35286               Dhx15              mep                 1     1.09 
    2 P46467;Q8BPY9;Q8VEJ9 Vps4b;Fignl1;Vps4a mep                 1     0.393
    3 P70698               Ctps1              cmp                 3     0.309
    4 O08795;O08795-2      Prkcsh             cmp                 2     0.921

Augmenting the data

  • log2 transformation


  • Normalization


  • Missing value imputation





Augmenting data for principal component analysis

#Pivot to wider format for principal component analysis 
df_feature_wide\<- df_sample_wise \|\> 
  mutate(intensity = unname(intensity)) \|\>
  pivot_wider(id_cols = c(cell_type,replicate_n),
              names_from = 'protein_groups', 
              values_from = 'intensity') \|\> 
  select(where(\~ !any(is.na(.)))) #Selects only columns without N/A

#Get numerical inputs 
df_input \<- df_feature_wide \|\> 
  select(-c('replicate_n','cell_type'))

#Group datapoints by cell types 
colourby \<- pull(df_feature_wide, 'cell_type')

#Projecting data unto principal components (pc) 
pc \<- prcomp(df_input, center = FALSE, scale = FALSE)

PCA Results

Volcano - augmentation

  • calculation of mean value and standard deviation
  • switching to wider dataframe (better for computation of fold and T test)
  • creating new dataframe using a function
# A tibble: 5 × 11
  protein_groups      mean_clp mean_cmp mean_gmp mean_hsc mean_mep sd_clp sd_cmp
  <chr>                  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
1 Q9D7N9                 0.930    0.939    0.958    1.75     0.974 0.0201 0.0244
2 A8C756;A8C756-2;A8…    0.319    0.334    0.350    0.606    0.394 0.0284 0.0190
3 O70152                 0.785    0.906    0.730    0.774    0.879 0.248  0.0327
4 Q6P9P6                 0.822    0.608    0.885    0.662    0.905 0.266  0.336 
5 Q8K2D3                 0.319    0.334    0.350    0.725    0.394 0.0284 0.0190
# ℹ 3 more variables: sd_gmp <dbl>, sd_hsc <dbl>, sd_mep <dbl>
volcano_augment <- function(df, later_cell, earlier_cell, n_later, n_earlier){

. . .
  return(data_set_for_visualisation)
})
# A tibble: 5 × 4
  protein_groups fold_log2   q_val expression     
  <chr>              <dbl>   <dbl> <chr>          
1 P11688            0.0432 1       not significant
2 Q3U7R1            1.58   0.0499  overexpressed  
3 Q8BIQ5            1.50   0.00265 overexpressed  
4 Q99KR3            0.0432 1       not significant
5 Q9D1G3            0.614  1       not significant

Volcano - visualisation

  • log2(fold) vs log10(q_values) plot is a rule of thumb in proteomics

Uniprot lookup

uniprot_lookup <- function(gene_id, dataframe, id_column, keyword_column){
  # does lookup in uniprot df and returns Keywords
  return(Keywords)
}

df_type <- df_norm_intensities |> 
  filter(!is.na(genes) & genes != "") |> 
  mutate(description_column = map_chr(
     .x = genes,
     .f = ~uniprot_lookup(gene_id=.x, 
                          dataframe=df_uniprot_mouse, 
                          id_column=`Gene Names`, keyword_column=Keywords)
   ))

Discussion

Data may take many shapes and forms

  • TidyverseR is nice and structured, but can also be restricting.

  • Some functions necessary like the t.test is for BaseR and not TidyR - making it difficult.

    Principal Component Analysis showed cell differentiation

  • The PCA in this case can be used to identify the cell differentiation pattern

  • This is in line with what was presented in literature for this data.

    Volcano plot and lookup to find overexpressed proteins

  • Through the volcano plots we can filter and identify the truly overexpressed proteins, which can then be looked up and studied for biological significance.

  • While we don’t concldue any high level biological understanding, we showcase the possibility of using TidyverseR to extrapolate biological information

Conclusion

While we have been going through many internal frustrations, we have been able to…

  • We have created code that is able to load, tidy and transform and visualize data containing 3700 observations across 5 cell types - and extrapolating biological meaning.

  • Create 2 functions that can create dataframes for volcanoplots, and 1 lookk up function to annotate keywords for protein groups.

  • The project has been successful in it’s main intend, which is to showcase a pipeline for understanding cell differentiation.

  • For future projects/studies, a higher number of observations and cell-types could be included to increase the resolution. The overall pipeline can be used with other cell-types - like humans.